1 00:00:07,961 --> 00:00:12,153 - Okay. Can everyone hear me? Okay. Sorry for the delay. 2 00:00:12,153 --> 00:00:24,081 I had a bit of technical difficulty. Today was the first time I was trying to use my new touch bar Mac book pro for presenting, and none of the adapters are working. So, I had to switch laptops at the last minute. So, thanks. Sorry about that. 3 00:00:25,353 --> 00:00:30,420 So, today is lecture 10. We're talking about recurrent neural networks. 4 00:00:30,420 --> 00:00:33,003 So, as of, as usual, a couple administrative notes. 5 00:00:33,003 --> 00:00:37,353 So, We're working hard on assignment one grading. 6 00:00:37,353 --> 00:00:46,251 Those grades will probably be out sometime later today. Hopefully, they can get out before the A2 deadline. That's what I'm hoping for. 7 00:00:46,251 --> 00:00:50,361 On a related note, Assignment two is due today at 11:59 p.m. 8 00:00:50,361 --> 00:00:56,633 so, who's done with that already? About half you guys. 9 00:00:56,633 --> 00:01:03,811 So, you remember, I did warn you when the assignment went out that it was quite long, to start early. So, you were warned about that. 10 00:01:03,811 --> 00:01:06,561 But, hopefully, you guys have some late days left. 11 00:01:06,561 --> 00:01:10,531 Also, as another reminder, the midterm will be in class on Tuesday. 12 00:01:10,531 --> 00:01:15,881 If you kind of look around the lecture hall, there are not enough seats in this room to seat all the enrolled students in the class. 13 00:01:15,881 --> 00:01:20,062 So, we'll actually be having the midterm in several other lecture halls across campus. 14 00:01:20,062 --> 00:01:26,099 And we'll be sending out some more details on exactly where to go in the next couple of days. 15 00:01:26,099 --> 00:01:28,179 So a bit of a, another bit of announcement. 16 00:01:28,179 --> 00:01:34,950 We've been working on this sort of fun bit of extra credit thing for you to play with that we're calling the training game. This is this cool browser based experience, 17 00:01:34,950 --> 00:01:39,927 where you can go in and interactively train neural networks and tweak the hyper parameters during training. 18 00:01:39,927 --> 00:01:47,646 And this should be a really cool interactive way for you to practice some of these hyper parameter tuning skills that we've been talking about the last couple of lectures. 19 00:01:47,646 --> 00:01:53,190 So this is not required, but this, I think, will be a really useful experience to gain a little bit more intuition 20 00:01:53,190 --> 00:01:57,481 into how some of these hyper parameters work for different types of data sets in practice. 21 00:01:57,481 --> 00:02:05,790 So we're still working on getting all the bugs worked out of this setup, and we'll probably send out some more instructions on exactly how this will work in the next couple of days. 22 00:02:05,790 --> 00:02:11,008 But again, not required. But please do check it out. I think it'll be really fun and a really cool thing for you to play with. 23 00:02:11,008 --> 00:02:17,204 And will give you a bit of extra credit if you do some, if you end up working with this and doing a couple of runs with it. 24 00:02:18,208 --> 00:02:23,458 So, we'll again send out some more details about this soon once we get all the bugs worked out. 25 00:02:24,720 --> 00:02:28,139 As a reminder, last time we were talking about CNN Architectures. 26 00:02:28,139 --> 00:02:35,006 We kind of walked through the time line of some of the various winners of the image net classification challenge, kind of the breakthrough result. 27 00:02:35,006 --> 00:02:41,331 As we saw was the AlexNet architecture in 2012, which was a nine layer convolutional network. It did amazingly well, 28 00:02:41,331 --> 00:02:48,081 and it sort of kick started this whole deep learning revolution in computer vision, and kind of brought a lot of these models into the mainstream. 29 00:02:48,081 --> 00:02:56,699 Then we skipped ahead a couple years, and saw that in 2014 image net challenge, we had these two really interesting models, VGG and GoogLeNet, which were much deeper. 30 00:02:56,699 --> 00:03:02,930 So VGG was, they had a 16 and a 19 layer model, and GoogLeNet was, I believe, a 22 layer model. 31 00:03:02,930 --> 00:03:11,230 Although one thing that is kind of interesting about these models is that the 2014 image net challenge was right before batch normalization was invented. 32 00:03:11,230 --> 00:03:18,761 So at this time, before the invention of batch normalization, training these relatively deep models of roughly twenty layers was very challenging. 33 00:03:18,761 --> 00:03:24,869 So, in fact, both of these two models had to resort to a little bit of hackery in order to get their deep models to converge. 34 00:03:24,869 --> 00:03:28,579 So for VGG, they had the 16 and 19 layer models, 35 00:03:28,579 --> 00:03:34,107 but actually they first trained an 11 layer model, because that was what they could get to converge. 36 00:03:34,107 --> 00:03:40,059 And then added some extra random layers in the middle and then continued training, actually training the 16 and 19 layer models. 37 00:03:40,059 --> 00:03:46,539 So, managing this training process was very challenging in 2014 before the invention of batch normalization. 38 00:03:46,539 --> 00:03:52,539 Similarly, for GoogLeNet, we saw that GoogLeNet has these auxiliary classifiers that were stuck into lower layers of the network. 39 00:03:52,539 --> 00:03:56,539 And these were not really needed for the class to, to get good classification performance. 40 00:03:56,539 --> 00:04:03,430 This was just sort of a way to cause extra gradient to be injected directly into the lower layers of the network. 41 00:04:03,430 --> 00:04:10,411 And this sort of, this again was before the invention of batch normalization and now once you have these networks with batch normalization, 42 00:04:10,411 --> 00:04:17,321 then you no longer need these slightly ugly hacks in order to get these deeper models to converge. 43 00:04:17,321 --> 00:04:24,350 Then we also saw in the 2015 image net challenge was this really cool model called ResNet, these residual networks 44 00:04:24,350 --> 00:04:28,310 that now have these shortcut connections that actually have these little residual blocks 45 00:04:28,310 --> 00:04:39,110 where we're going to take our input, pass it through the residual blocks, and then add that output of the, then add our input to the block, to the output from these convolutional layers. 46 00:04:39,110 --> 00:04:43,308 This is kind of a funny architecture, but it actually has two really nice properties. 47 00:04:43,308 --> 00:04:49,531 One is that if we just set all the weights in this residual block to zero, then this block is competing the identity. 48 00:04:49,531 --> 00:04:55,681 So in some way, it's relatively easy for this model to learn not to use the layers that it doesn't need. 49 00:04:55,681 --> 00:05:02,171 In addition, it kind of adds this interpretation to L2 regularization in the context of these neural networks, 50 00:05:02,171 --> 00:05:08,321 cause once you put L2 regularization, remember, on your, on the weights of your network, that's going to drive all the parameters towards zero. 51 00:05:08,321 --> 00:05:12,739 And maybe your standard convolutional architecture is driving towards zero. Maybe it doesn't make sense. 52 00:05:12,739 --> 00:05:20,510 But in the context of a residual network, if you drive all the parameters towards zero, that's kind of encouraging the model to not use layers that it doesn't need, 53 00:05:20,510 --> 00:05:26,310 because it will just drive those, the residual blocks towards the identity, whether or not needed for classification. 54 00:05:26,310 --> 00:05:31,371 The other really useful property of these residual networks has to do with the gradient flow in the backward paths. 55 00:05:31,371 --> 00:05:34,361 If you remember what happens at these addition gates in the backward pass, 56 00:05:34,361 --> 00:05:39,881 when upstream gradient is coming in through an addition gate, then it will split and fork along these two different paths. 57 00:05:39,881 --> 00:05:46,361 So then, when upstream gradient comes in, it'll take one path through these convolutional blocks, 58 00:05:46,361 --> 00:05:50,811 but it will also have a direct connection of the gradient through this residual connection. 59 00:05:50,811 --> 00:05:59,150 So then when you look at, when you imagine stacking many of these residual blocks on top of each other, and our network ends up with hundreds of, potentially hundreds of layers. 60 00:05:59,150 --> 00:06:05,561 Then, these residual connections give a sort of gradient super highway for gradients to flow backward through the entire network. 61 00:06:05,561 --> 00:06:09,630 And this allows it to train much easier and much faster. 62 00:06:09,630 --> 00:06:15,738 And actually allows these things to converge reasonably well, even when the model is potentially hundreds of layers deep. 63 00:06:15,738 --> 00:06:21,550 And this idea of managing gradient flow in your models is actually super important everywhere in machine learning. 64 00:06:21,550 --> 00:06:28,564 And super prevalent in recurrent networks as well. So we'll definitely revisit this idea of gradient flow later in today's lecture. 65 00:06:31,148 --> 00:06:38,068 So then, we kind of also saw a couple other more exotic, more recent CNN architectures last time, including DenseNet and FractalNet, 66 00:06:38,068 --> 00:06:43,070 and once you think about these architectures in terms of gradient flow, they make a little bit more sense. 67 00:06:43,070 --> 00:06:48,619 These things like DenseNet and FractalNet are adding these additional shortcut or identity connections inside the model. 68 00:06:48,619 --> 00:07:00,571 And if you think about what happens in the backwards pass for these models, these additional funny topologies are basically providing direct paths for gradients to flow from the loss at the end of the network more easily into all the different layers of the network. 69 00:07:00,571 --> 00:07:09,760 So I think that, again, this idea of managing gradient flow properly in your CNN Architectures is something that we've really seen a lot more in the last couple of years. 70 00:07:09,760 --> 00:07:15,221 And will probably see more moving forward as more exotic architectures are invented. 71 00:07:16,257 --> 00:07:24,331 We also saw this kind of nice plot, plotting performance of the number of flops versus the number of parameters versus the run time of these various models. 72 00:07:24,331 --> 00:07:27,971 And there's some interesting characteristics that you can dive in and see from this plot. 73 00:07:27,971 --> 00:07:32,801 One idea is that VGG and AlexNet have a huge number of parameters, 74 00:07:32,801 --> 00:07:37,119 and these parameters actually come almost entirely from the fully connected layers of the models. 75 00:07:37,119 --> 00:07:39,959 So AlexNet has something like roughly 62 million parameters, 76 00:07:39,959 --> 00:07:47,771 and if you look at that last fully connected layer, the final fully connected layer in AlexNet is going from an activation volume of six by six by 256 77 00:07:47,771 --> 00:07:51,190 into this fully connected vector of 496. 78 00:07:51,190 --> 00:07:56,851 So if you imagine what the weight matrix needs to look like at that layer, the weight matrix is gigantic. 79 00:07:56,851 --> 00:08:01,921 It's number of entries is six by six, six times six times 256 times 496. 80 00:08:01,921 --> 00:08:06,370 And if you multiply that out, you see that that single layer has 38 million parameters. 81 00:08:06,370 --> 00:08:11,859 So more than half of the parameters of the entire AlexNet model are just sitting in that last fully connected layer. 82 00:08:11,859 --> 00:08:24,241 And if you add up all the parameters in just the fully connected layers of AlexNet, including these other fully connected layers, you see something like 59 of the 62 million parameters in AlexNet are sitting in these fully connected layers. 83 00:08:24,241 --> 00:08:31,110 So then when we move other architectures, like GoogLeNet and ResNet, they do away with a lot of these large fully connected layers 84 00:08:31,110 --> 00:08:33,698 in favor of global average pooling at the end of the network. 85 00:08:33,698 --> 00:08:40,935 And this allows these networks to really cut, these nicer architectures, to really cut down the parameter count in these architectures. 86 00:08:44,463 --> 00:08:49,604 So that was kind of our brief recap of the CNN architectures that we saw last lecture, and then today, 87 00:08:49,604 --> 00:08:56,321 we're going to move to one of my favorite topics to talk about, which is recurrent neural networks. 88 00:08:56,321 --> 00:09:03,222 So, so far in this class, we've seen, what I like to think of as kind of a vanilla feed forward network, all of our network architectures have this flavor, 89 00:09:03,222 --> 00:09:08,593 where we receive some input and that input is a fixed size object, like an image or vector. 90 00:09:08,593 --> 00:09:13,850 That input is fed through some set of hidden layers and produces a single output, 91 00:09:13,850 --> 00:09:18,876 like a classifications, like a set of classifications scores over a set of categories. 92 00:09:20,071 --> 00:09:25,942 But in some context in machine learning, we want to have more flexibility in the types of data that our models can process. 93 00:09:25,942 --> 00:09:35,313 So once we move to this idea of recurrent neural networks, we have a lot more opportunities to play around with the types of input and output data that our networks can handle. 94 00:09:35,313 --> 00:09:41,009 So once we have recurrent neural networks, we can do what we call these one to many models. 95 00:09:41,009 --> 00:09:48,721 Or where maybe our input is some object of fixed size, like an image, but now our output is a sequence of variable length, such as a caption. 96 00:09:48,721 --> 00:09:54,081 Where different captions might have different numbers of words, so our output needs to be variable in length. 97 00:09:54,081 --> 00:09:56,491 We also might have many to one models, 98 00:09:56,491 --> 00:10:01,001 where our input could be variably sized. This might be something like a piece of text, 99 00:10:01,001 --> 00:10:06,161 and we want to say what is the sentiment of that text, whether it's positive or negative in sentiment. 100 00:10:06,161 --> 00:10:12,512 Or in a computer vision context, you might imagine taking as input a video, and that video might have a variable number of frames. 101 00:10:12,512 --> 00:10:16,401 And now we want to read this entire video of potentially variable length. 102 00:10:16,401 --> 00:10:22,721 And then at the end, make a classification decision about maybe what kind of activity or action is going on in that video. 103 00:10:22,721 --> 00:10:29,931 We also have a, we might also have problems where we want both the inputs and the output to be variable in length. 104 00:10:29,931 --> 00:10:37,302 We might see something like this in machine translation, where our input is some, maybe, sentence in English, which could have a variable length, 105 00:10:37,302 --> 00:10:41,633 and our output is maybe some sentence in French, which also could have a variable length. 106 00:10:41,633 --> 00:10:46,801 And crucially, the length of the English sentence might be different from the length of the French sentence. 107 00:10:46,801 --> 00:10:53,931 So we need some models that have the capacity to accept both variable length sequences on the input and on the output. 108 00:10:53,931 --> 00:11:04,771 Finally, we might also consider problems where our input is variably length, like something like a video sequence with a variable number of frames. And now we want to make a decision for each element of that input sequence. 109 00:11:04,771 --> 00:11:11,891 So in the context of videos, that might be making some classification decision along every frame of the video. 110 00:11:11,891 --> 00:11:17,401 And recurrent neural networks are this kind of general paradigm for handling variable sized sequence data 111 00:11:17,401 --> 00:11:23,469 that allow us to pretty naturally capture all of these different types of setups in our models. 112 00:11:24,349 --> 00:11:33,752 So recurring neural networks are actually important, even for some problems that have a fixed size input and a fixed size output. Recurrent neural networks can still be pretty useful. 113 00:11:33,752 --> 00:11:38,793 So in this example, we might want to do, for example, sequential processing of our input. 114 00:11:38,793 --> 00:11:46,227 So here, we're receiving a fixed size input like an image, and we want to make a classification decision about, like, what number is being shown in this image? 115 00:11:46,227 --> 00:11:50,393 But now, rather than just doing a single feed forward pass and making the decision all at once, 116 00:11:50,393 --> 00:11:55,553 this network is actually looking around the image and taking various glimpses of different parts of the image. 117 00:11:55,553 --> 00:12:01,742 And then after making some series of glimpses, then it makes its final decision as to what kind of number is present. 118 00:12:01,742 --> 00:12:17,473 So here, we had one, So here, even though our input and outputs, our input was an image, and our output was a classification decision, even this context, this idea of being able to handle variably length processing with recurrent neural networks can lead to some really interesting types of models. 119 00:12:17,473 --> 00:12:23,923 There's a really cool paper that I like that applied this same type of idea to generating new images. 120 00:12:23,923 --> 00:12:29,723 Where now, we want the model to synthesize brand new images that look kind of like the images it saw in training, 121 00:12:29,723 --> 00:12:36,254 and we can use a recurrent neural network architecture to actually paint these output images sort of one piece at a time in the output. 122 00:12:36,254 --> 00:12:46,380 You can see that, even though our output is this fixed size image, we can have these models that are working over time to compute parts of the output one at a time sequentially. 123 00:12:46,380 --> 00:12:51,662 And we can use recurrent neural networds for that type of setup as well. 124 00:12:51,662 --> 00:12:58,785 So after this sort of cool pitch about all these cool things that RNNs can do, you might wonder, like what exactly are these things? 125 00:12:58,785 --> 00:13:04,163 So in general, a recurrent neural network is this little, has this little recurrent core cell 126 00:13:04,163 --> 00:13:11,382 and it will take some input x, feed that input into the RNN, and that RNN has some internal hidden state, 127 00:13:11,382 --> 00:13:17,641 and that internal hidden state will be updated every time that the RNN reads a new input. 128 00:13:17,641 --> 00:13:23,980 And that internal hidden state will be then fed back to the model the next time it reads an input. 129 00:13:23,980 --> 00:13:28,822 And frequently, we will want our RNN"s to also produce some output at every time step, 130 00:13:28,822 --> 00:13:31,043 so we'll have this pattern where it will read an input, 131 00:13:31,043 --> 00:13:34,469 update its hidden state, and then produce an output. 132 00:13:35,814 --> 00:13:40,961 So then the question is what is the functional form of this recurrence relation that we're computing? 133 00:13:40,961 --> 00:13:46,443 So inside this little green RNN block, we're computing some recurrence relation, with a function f. 134 00:13:46,443 --> 00:13:49,094 So this function f will depend on some weights, w. 135 00:13:49,094 --> 00:13:55,374 It will accept the previous hidden state, h t - 1, as well as the input at the current state, x t, 136 00:13:55,374 --> 00:14:01,420 and this will output the next hidden state, or the updated hidden state, that we call h t. 137 00:14:01,420 --> 00:14:11,552 And now, then as we read the next input, this hidden state, this new hidden state, h t, will then just be passed into the same function as we read the next input, x t plus one. 138 00:14:11,552 --> 00:14:21,797 And now, if we wanted to produce some output at every time step of this network, we might attach some additional fully connected layers that read in this h t at every time step. 139 00:14:21,797 --> 00:14:27,327 And make that decision based on the hidden state at every time step. 140 00:14:27,327 --> 00:14:35,662 And one thing to note is that we use the same function, f w, and the same weights, w, at every time step of the computation. 141 00:14:36,921 --> 00:14:43,434 So then kind of the simplest function form that you can imagine is what we call this vanilla recurrent neural network. 142 00:14:43,434 --> 00:14:46,866 So here, we have this same functional form from the previous slide, 143 00:14:46,866 --> 00:14:52,483 where we're taking in our previous hidden state and our current input and we need to produce the next hidden state. 144 00:14:52,483 --> 00:15:00,124 And the kind of simplest thing you might imagine is that we have some weight matrix, w x h, that we multiply against the input, x t, 145 00:15:00,124 --> 00:15:05,615 as well as another weight matrix, w h h, that we multiply against the previous hidden state. 146 00:15:05,615 --> 00:15:09,327 So we make these two multiplications against our two states, add them together, 147 00:15:09,327 --> 00:15:13,514 and squash them through a tanh, so we get some kind of non linearity in the system. 148 00:15:13,514 --> 00:15:17,312 You might be wondering why we use a tanh here and not some other type of non-linearity? 149 00:15:17,312 --> 00:15:20,594 After all that we've said negative about tanh's in previous lectures, 150 00:15:20,594 --> 00:15:26,507 and I think we'll return a little bit to that later on when we talk about more advanced architectures, like lstm. 151 00:15:27,346 --> 00:15:33,394 So then, this, So then, in addition in this architecture, if we wanted to produce some y t at every time step, 152 00:15:33,394 --> 00:15:40,375 you might have another weight matrix, w, you might have another weight matrix that accepts this hidden state and then transforms it to some y 153 00:15:40,375 --> 00:15:44,826 to produce maybe some class score predictions at every time step. 154 00:15:44,826 --> 00:15:51,487 And when I think about recurrent neural networks, I kind of think about, you can also, you can kind of think of recurrent neural networks in two ways. 155 00:15:51,487 --> 00:15:57,095 One is this concept of having a hidden state that feeds back at itself, recurrently. 156 00:15:57,095 --> 00:16:05,914 But I find that picture a little bit confusing. And sometimes, I find it clearer to think about unrolling this computational graph for multiple time steps. 157 00:16:05,914 --> 00:16:11,786 And this makes the data flow of the hidden states and the inputs and the outputs and the weights maybe a little bit more clear. 158 00:16:11,786 --> 00:16:15,494 So then at the first time step, we'll have some initial hidden state h zero. 159 00:16:15,494 --> 00:16:22,415 This is usually initialized to zeros for most context, in most contexts, an then we'll have some input, x t. 160 00:16:22,415 --> 00:16:28,324 This initial hidden state, h zero, and our current input, x t, will go into our f w function. 161 00:16:28,324 --> 00:16:36,154 This will produce our next hidden state, h one. And then, we'll repeat this process when we receive the next input. 162 00:16:36,154 --> 00:16:42,847 So now our current h one and our x one, will go into that same f w, to produce our next output, h two. 163 00:16:42,847 --> 00:16:50,866 And this process will repeat over and over again, as we consume all of the input, x ts, in our sequence of inputs. 164 00:16:50,866 --> 00:16:58,036 And now, one thing to note, is that we can actually make this even more explicit and write the w matrix in our computational graph. 165 00:16:58,036 --> 00:17:03,415 And here you can see that we're re-using the same w matrix at every time step of the computation. 166 00:17:03,415 --> 00:17:11,006 So now every time that we have this little f w block, it's receiving a unique h and a unique x, but all of these blocks are taking the same w. 167 00:17:11,007 --> 00:17:20,786 And if you remember, we talked about how gradient flows in back propagation, when you re-use the same, when you re-use the same node multiple times in a computational graph, 168 00:17:20,786 --> 00:17:28,218 then remember during the backward pass, you end up summing the gradients into the w matrix when you're computing a d los d w. 169 00:17:28,218 --> 00:17:32,526 So, if you kind of think about the back propagation for this model, 170 00:17:32,526 --> 00:17:42,503 then you'll have a separate gradient for w flowing from each of those time steps, and then the final gradient for w will be the sum of all of those individual per time step gradiants. 171 00:17:43,615 --> 00:17:47,727 We can also write to this y t explicitly in this computational graph. 172 00:17:47,727 --> 00:17:54,858 So then, this output, h t, at every time step might feed into some other little neural network that can produce a y t, 173 00:17:54,858 --> 00:17:59,087 which might be some class scores, or something like that, at every time step. 174 00:17:59,087 --> 00:18:00,738 We can also make the loss more explicit. 175 00:18:00,738 --> 00:18:14,068 So in many cases, you might imagine producing, you might imagine that you have some ground truth label at every time step of your sequence, and then you'll compute some loss, some individual loss, at every time step of these outputs, y t's. 176 00:18:14,068 --> 00:18:22,497 And this loss might, it will frequently be something like soft max loss, in the case where you have, maybe, a ground truth label at every time step of the sequence. 177 00:18:22,497 --> 00:18:27,887 And now the final loss for the entire, for this entire training stop, will be the sum of these individual losses. 178 00:18:27,887 --> 00:18:34,196 So now, we had a scaler loss at every time step? And we just summed them up to get our final scaler loss at the top of the network. 179 00:18:34,196 --> 00:18:42,098 And now, if you think about, again, back propagation through this thing, we need, in order to train the model, we need to compute the gradient of the loss with respect to w. 180 00:18:42,098 --> 00:18:46,178 So, we'll have loss flowing from that final loss into each of these time steps. 181 00:18:46,178 --> 00:18:49,840 And then each of those time steps will compute a local gradient on the weights, w, 182 00:18:49,840 --> 00:18:54,343 which will all then be summed to give us our final gradient for the weights, w. 183 00:18:55,597 --> 00:19:01,188 Now if we have a, sort of, this many to one situation, where maybe we want to do something like sentiment analysis, 184 00:19:01,188 --> 00:19:05,799 then we would typically make that decision based on the final hidden state of this network. 185 00:19:05,799 --> 00:19:11,868 Because this final hidden state kind of summarizes all of the context from the entire sequence. 186 00:19:11,868 --> 00:19:14,788 Also, if we have a kind of a one to many situation, 187 00:19:14,788 --> 00:19:19,319 where we want to receive a fix sized input and then produce a variably sized output. 188 00:19:19,319 --> 00:19:26,050 Then you'll commonly use that fixed size input to initialize, somehow, the initial hidden state of the model, 189 00:19:26,050 --> 00:19:30,079 and now the recurrent network will tick for each cell in the output. 190 00:19:30,079 --> 00:19:36,915 And now, as you produce your variably sized output, you'll unroll the graph for each element in the output. 191 00:19:38,490 --> 00:19:44,308 So this, when we talk about the sequence to sequence models where you might do something like machine translation, 192 00:19:44,308 --> 00:19:47,648 where you take a variably sized input and a variably sized output. 193 00:19:47,648 --> 00:19:52,398 You can think of this as a combination of the many to one, plus a one to many. 194 00:19:52,398 --> 00:19:56,900 So, we'll kind of proceed in two stages, what we call an encoder and a decoder. 195 00:19:56,900 --> 00:20:02,159 So if you're the encoder, we'll receive the variably sized input, which might be your sentence in English, 196 00:20:02,159 --> 00:20:08,110 and then summarize that entire sentence using the final hidden state of the encoder network. 197 00:20:08,110 --> 00:20:15,769 And now we're in this many to one situation where we've summarized this entire variably sized input in this single vector, 198 00:20:15,769 --> 00:20:23,111 and now, we have a second decoder network, which is a one to many situation, which will input that single vector summarizing the input sentence 199 00:20:23,111 --> 00:20:28,969 and now produce this variably sized output, which might be your sentence in another language. 200 00:20:28,969 --> 00:20:34,609 And now in this variably sized output, we might make some predictions at every time step, maybe about what word to use. 201 00:20:34,609 --> 00:20:38,199 And you can imagine kind of training this entire thing by unrolling this computational graph 202 00:20:38,199 --> 00:20:44,692 summing the losses at the output sequence and just performing back propagation, as usual. 203 00:20:44,692 --> 00:20:50,940 So as a bit of a concrete example, one thing that we frequently use recurrent neural networks for, is this problem called language modeling. 204 00:20:50,940 --> 00:21:00,908 So in the language modeling problem, we want to read some sequence of, we want to have our network, sort of, understand how to produce natural language. 205 00:21:00,908 --> 00:21:06,601 So in the, so this, this might happen at the character level where our model will produce characters one at a time. 206 00:21:06,601 --> 00:21:10,769 This might also happen at the word level where our model will produce words one at a time. 207 00:21:10,769 --> 00:21:14,740 But in a very simple example, you can imagine this character level language model 208 00:21:14,740 --> 00:21:22,780 where we want, where the network will read some sequence of characters and then it needs to predict, what will the next character be in this stream of text? 209 00:21:22,780 --> 00:21:33,884 So in this example, we have this very small vocabulary of four letters, h, e, l, and o, and we have this example training sequence of the word hello, h, e, l, l, o. 210 00:21:33,884 --> 00:21:49,689 So during training, when we're training this language model, we will feed the characters of this training sequence as inputs, as x ts, to out input of our, we'll feed the characters of our training sequence, 211 00:21:49,689 --> 00:21:53,980 these will be the x ts that we feed in as the inputs to our recurrent neural network. 212 00:21:53,980 --> 00:22:01,039 And then, each of these inputs, it's a letter, and we need to figure out a way to represent letters in our network. 213 00:22:01,039 --> 00:22:07,460 So what we'll typically do is figure out what is our total vocabulary. In this case, our vocabulary has four elements. 214 00:22:07,460 --> 00:22:12,589 And each letter will be represented by a vector that has zeros in every slot but one, 215 00:22:12,589 --> 00:22:16,628 and a one for the slot in the vocabulary corresponding to that letter. 216 00:22:16,628 --> 00:22:22,324 In this little example, since our vocab has the four letters, h, e, l, o, then our input sequence, 217 00:22:22,324 --> 00:22:28,684 the h is represented by a four element vector with a one in the first slot and zero's in the other three slots. 218 00:22:28,684 --> 00:22:33,139 And we use the same sort of pattern to represent all the different letters in the input sequence. 219 00:22:34,914 --> 00:22:41,874 Now, during this forward pass of what this network is doing, at the first time step, it will receive the input letter h. 220 00:22:41,874 --> 00:22:48,594 That will go into the first RNN, to the RNN cell, and then we'll produce this output, y t, 221 00:22:48,594 --> 00:22:56,024 which is the network making predictions about for each letter in the vocabulary, which letter does it think is most likely going to come next. 222 00:22:56,024 --> 00:23:01,405 In this example, the correct output letter was e because our training sequence was hello, 223 00:23:01,405 --> 00:23:06,861 but the model is actually predicting, I think it's actually predicting o as the most likely letter. 224 00:23:07,850 --> 00:23:13,889 So in this case, this prediction was wrong and we would use softmaxt loss to quantify our unhappiness with these predictions. 225 00:23:13,889 --> 00:23:19,741 The next time step, we would feed in the second letter in the training sequence, e, and this process will repeat. 226 00:23:19,741 --> 00:23:27,271 We'll now represent e as a vector. Use that input vector together with the previous hidden state to produce a new hidden state 227 00:23:27,271 --> 00:23:31,912 and now use the second hidden state to, again, make predictions over every letter in the vocabulary. 228 00:23:31,912 --> 00:23:36,810 In this case, because our training sequence was hello, after the letter e, we want our model to predict l. 229 00:23:36,810 --> 00:23:41,954 In this case, our model may have very low predictions for the letter l, so we would incur high loss. 230 00:23:44,244 --> 00:23:50,343 And you kind of repeat this process over and over, and if you train this model with many different sequences, 231 00:23:50,343 --> 00:23:58,596 then eventually it should learn how to predict the next character in a sequence based on the context of all the previous characters that it's seen before. 232 00:23:59,893 --> 00:24:01,655 And now, if you think about what happens at test time, 233 00:24:01,655 --> 00:24:07,594 after we train this model, one thing that we might want to do with it is a sample from the model, 234 00:24:07,594 --> 00:24:15,103 and actually use this trained neural network model to synthesize new text that kind of looks similar in spirit to the text that it was trained on. 235 00:24:15,103 --> 00:24:22,716 The way that this will work is we'll typically see the model with some input prefix of text. In this case, the prefix is just the single letter h, 236 00:24:22,716 --> 00:24:27,295 and now we'll feed that letter h through the first time step of our recurrent neural network. 237 00:24:27,295 --> 00:24:32,916 It will product this distribution of scores over all the characters in the vocabulary. 238 00:24:32,916 --> 00:24:37,501 Now, at training time, we'll use these scores to actually sample from it. 239 00:24:37,501 --> 00:24:41,421 So we'll use a softmaxt function to convert those scores into a probability distribution 240 00:24:41,421 --> 00:24:47,362 and then we will sample from that probability distribution to actually synthesize the second letter in the sequence. 241 00:24:47,362 --> 00:24:54,771 And in this case, even though the scores were pretty bad, maybe we got lucky and sampled the letter e from this probability distribution. 242 00:24:54,771 --> 00:25:02,492 And now, we'll take this letter e that was sampled from this distribution and feed it back as input into the network at the next time step. 243 00:25:02,492 --> 00:25:15,008 Now, we'll take this e, pull it down from the top, feed it back into the network as one of these, sort of, one hot vectorial representations, and then repeat the process in order to synthesize the second letter in the output. 244 00:25:15,008 --> 00:25:20,722 And we can repeat this process over and over again to synthesize a new sequence using this trained model, 245 00:25:20,722 --> 00:25:27,712 where we're synthesizing the sequence one character at a time using these predicted probability distributions at each time step. 246 00:25:27,712 --> 00:25:28,545 Question? 247 00:25:34,792 --> 00:25:41,315 Yeah, that's a great question. So the question is why might we sample instead of just taking the character with the largest score? 248 00:25:41,315 --> 00:25:46,555 In this case, because of the probability distribution that we had, it was impossible to get the right character, 249 00:25:47,451 --> 00:25:51,512 so we had the sample so the example could work out, and it would make sense. 250 00:25:51,512 --> 00:25:54,384 But in practice, sometimes you'll see both. 251 00:25:54,384 --> 00:25:59,482 So sometimes you'll just take the argmax probability, and that will sometimes be a little bit more stable, 252 00:25:59,482 --> 00:26:04,264 but one advantage of sampling, in general, is that it lets you get diversity from your models. 253 00:26:04,264 --> 00:26:11,421 Sometimes you might have the same input, maybe the same prefix, or in the case of image captioning, maybe the same image. 254 00:26:11,421 --> 00:26:20,032 But then if you sample rather than taking the argmax, then you'll see that sometimes these trained models are actually able to produce multiple different types of reasonable output sequences, 255 00:26:20,032 --> 00:26:23,824 depending on the kind, depending on which samples they take at the first time steps. 256 00:26:23,824 --> 00:26:29,213 It's actually kind of a benefit cause we can get now more diversity in our outputs. 257 00:26:29,213 --> 00:26:30,630 Another question? 258 00:26:35,143 --> 00:26:40,435 Could we feed in the softmax vector instead of the one element vector? You mean at test time? 259 00:26:46,162 --> 00:26:51,373 Yeah yeah, so the question is, at test time, could we feed in this whole softmax vector rather than a one hot vector? 260 00:26:51,373 --> 00:26:56,782 There's kind of two problems with that. One is that that's very different from the data that it saw at training time. 261 00:26:56,782 --> 00:27:05,413 In general, if you ask your model to do something at test time, which is different from training time, then it'll usually blow up. It'll usually give you garbage and you'll usually be sad. 262 00:27:05,413 --> 00:27:09,112 The other problem is that in practice, our vocabularies might be very large. 263 00:27:09,112 --> 00:27:13,202 So maybe, in this simple example, our vocabulary is only four elements, so it's not a big problem. 264 00:27:13,202 --> 00:27:18,773 But if you're thinking about generating words one at a time, now your vocabulary is every word in the English language, 265 00:27:18,773 --> 00:27:21,162 which could be something like tens of thousands of elements. 266 00:27:21,162 --> 00:27:30,533 So in practice, this first element, this first operation that's taking in this one hot vector, is often performed using sparse vector operations rather than dense factors. 267 00:27:30,533 --> 00:27:36,827 It would be, sort of, computationally really bad if you wanted to have this load of 10,000 elements softmax vector. 268 00:27:36,827 --> 00:27:40,302 So that's usually why we use a one hot instead, even at test time. 269 00:27:42,121 --> 00:27:47,104 This idea that we have a sequence and we produce an output at every time step of the sequence 270 00:27:47,104 --> 00:27:51,543 and then finally compute some loss, this is sometimes called backpropagation through time 271 00:27:51,543 --> 00:27:57,053 because you're imagining that in the forward pass, you're kind of stepping forward through time and then during the backward pass, 272 00:27:57,053 --> 00:28:00,762 you're sort of going backwards through time to compute all your gradients. 273 00:28:00,762 --> 00:28:06,162 This can actually be kind of problematic if you want to train the sequences that are very, very long. 274 00:28:06,162 --> 00:28:15,453 So if you imagine that we were kind of trying to train a neural network language model on maybe the entire text of Wikipedia, which is, by the way, something that people do pretty frequently, 275 00:28:15,453 --> 00:28:23,328 this would be super slow, and every time we made a gradient step, we would have to make a forward pass through the entire text of all of wikipedia, 276 00:28:23,328 --> 00:28:27,813 and then make a backward pass through all of wikipedia, and then make a single gradient update. 277 00:28:27,813 --> 00:28:34,172 And that would be super slow. Your model would never converge. It would also take a ridiculous amount of memory so this would be just really bad. 278 00:28:34,172 --> 00:28:39,933 In practice, what people do is this, sort of, approximation called truncated backpropagation through time. 279 00:28:39,933 --> 00:28:45,562 Here, the idea is that, even though our input sequence is very, very long, and even potentially infinite, 280 00:28:45,562 --> 00:28:56,232 what we'll do is that during, when we're training the model, we'll step forward for some number of steps, maybe like a hundred is kind of a ballpark number that people frequently use, 281 00:28:56,232 --> 00:29:06,261 and we'll step forward for maybe a hundred steps, compute a loss only over this sub sequence of the data, and then back propagate through this sub sequence, and now make a gradient step. 282 00:29:06,261 --> 00:29:12,064 And now, when we repeat, well, we still have these hidden states that we computed from the first batch, 283 00:29:12,064 --> 00:29:20,631 and now, when we compute this next batch of data, we will carry those hidden states forward in time, so the forward pass will be exactly the same. 284 00:29:20,631 --> 00:29:28,124 But now when we compute a gradient step for this next batch of data, we will only backpropagate again through this second batch. 285 00:29:28,124 --> 00:29:32,760 Now, we'll make a gradient step based on this truncated backpropagation through time. 286 00:29:32,760 --> 00:29:38,250 This process will continue, where now when we make the next batch, we'll again copy these hidden states forward, 287 00:29:38,250 --> 00:29:43,840 but then step forward and then step backward, but only for some small number of time steps. 288 00:29:43,840 --> 00:29:49,872 So this is, you can kind of think of this as being an alegist who's the cast at gradient descent in the case of sequences. 289 00:29:49,872 --> 00:29:53,690 Remember, when we talked about training our models on large data sets, 290 00:29:53,690 --> 00:29:58,720 then these data sets, it would be super expensive to compute the gradients over every element in the data set. 291 00:29:58,720 --> 00:30:02,520 So instead, we kind of take small samples, small mini batches instead, 292 00:30:02,520 --> 00:30:08,053 and use mini batches of data to compute gradient stops in any kind of image classification case. 293 00:30:08,053 --> 00:30:08,886 Question? 294 00:30:12,441 --> 00:30:15,989 Is this kind of, the question is, is this kind of making the Mark Hobb assumption? 295 00:30:15,989 --> 00:30:20,581 No, not really. Because we're carrying this hidden state forward in time forever. 296 00:30:21,442 --> 00:30:25,792 It's making a Marcovian assumption in the sense that, conditioned on the hidden state, 297 00:30:25,792 --> 00:30:31,101 but the hidden state is all that we need to predict the entire future of the sequence. 298 00:30:31,101 --> 00:30:35,941 But that assumption is kind of built into the recurrent neural network formula from the start. 299 00:30:35,941 --> 00:30:39,032 And that's not really particular to back propagation through time. 300 00:30:39,032 --> 00:30:51,239 Back propagation through time, or sorry, truncated back prop though time is just the way to approximate these gradients without going making a backwards pass through your potentially very large sequence of data. 301 00:30:52,677 --> 00:30:59,649 This all sounds very complicated and confusing and it sounds like a lot of code to write, but in fact, this can acutally be pretty concise. 302 00:30:59,649 --> 00:31:07,474 Andrea has this example of what he calls min-char-rnn, that does all of this stuff in just like a 112 lines of Python. 303 00:31:07,474 --> 00:31:11,725 It handles building the vocabulary. It trains the model with truncated back propagation through time. 304 00:31:11,725 --> 00:31:16,584 And then, it can actually sample from that model in actually not too much code. 305 00:31:16,584 --> 00:31:20,862 So even though this sounds like kind of a big, scary process, it's actually not too difficult. 306 00:31:20,862 --> 00:31:27,954 I'd encourage you, if you're confused, to maybe go check this out and step through the code on your own time, and see, kind of, all of these concrete steps happening in code. 307 00:31:27,954 --> 00:31:34,473 So this is all in just a single file, all using numpy with no dependencies. This was relatively easy to read. 308 00:31:35,584 --> 00:31:41,593 So then, once we have this idea of training a recurrent neural network language model, we can actually have a lot of fun with this. 309 00:31:41,593 --> 00:31:52,304 And we can take in, sort of, any text that we want. Take in, like, whatever random text you can think of from the internet, train our recurrent neural network language model on this text, and then generate new text. 310 00:31:52,304 --> 00:32:02,634 So in this example, we took this entire text of all of Shakespeare's works, and then used that to train a recurrent neural network language model on all of Shakespeare. 311 00:32:02,634 --> 00:32:11,584 And you can see that the beginning of training, it's kind of producing maybe random gibberish garbage, but throughout the course of training, it ends up producing things that seem relatively reasonable. 312 00:32:11,584 --> 00:32:18,224 And after you've, after this model has been trained pretty well, then it produces text that seems, kind of, Shakespeare-esque to me. 313 00:32:18,224 --> 00:32:24,754 "Why do what that day," replied, whatever, right, you can read this. Like, it kind of looks kind of like Shakespeare. 314 00:32:24,754 --> 00:32:31,264 And if you actually train this model even more, and let it converge even further, and then sample these even longer sequences, 315 00:32:31,264 --> 00:32:35,864 you can see that it learns all kinds of crazy cool stuff that really looks like a Shakespeare play. 316 00:32:35,864 --> 00:32:40,016 It knows that it uses, maybe, these headings to say who's speaking. 317 00:32:40,016 --> 00:32:45,565 Then it produces these bits of text that have crazy dialogue that sounds kind of Shakespeare-esque. 318 00:32:45,565 --> 00:32:47,744 It knows to put line breaks in between these different things. 319 00:32:47,744 --> 00:32:52,958 And this is all, like, really cool, all just sort of learned from the structure of the data. 320 00:32:52,958 --> 00:33:02,725 We can actually get even crazier than this. This was one of my favorite examples. I found online, there's this. Is anyone a mathematician in this room? 321 00:33:02,725 --> 00:33:07,325 Has anyone taken an algebraic topology course by any chance? Wow, a couple, that's impressive. 322 00:33:07,325 --> 00:33:15,114 So you probably know more algebraic topology than me, but I found this open source algebraic topology textbook online. 323 00:33:15,114 --> 00:33:19,554 It's just a whole bunch of tech files that are like this super dense mathematics. 324 00:33:19,554 --> 00:33:26,325 And LaTac, cause LaTac is sort of this, let's you write equations and diagrams and everything just using plain text. 325 00:33:26,325 --> 00:33:33,146 We can actually train our recurrent neural network language model on the raw Latac source code of this algebraic topology textbook. 326 00:33:33,146 --> 00:33:41,446 And if we do that, then after we sample from the model, then we get something that seems like, kind of like algebraic topology. 327 00:33:41,446 --> 00:33:46,679 So it knows to like put equations. It puts all kinds of crazy stuff. 328 00:33:46,679 --> 00:33:52,773 It's like, to prove study, we see that F sub U is a covering of x prime, blah, blah, blah, blah, blah. 329 00:33:52,773 --> 00:33:56,576 It knows where to put unions. It knows to put squares at the end of proofs. 330 00:33:56,576 --> 00:34:00,026 It makes lemmas. It makes references to previous lemmas. 331 00:34:00,026 --> 00:34:08,417 Right, like we hear, like. It's namely a bi-lemma question. We see that R is geometrically something. So it's actually pretty crazy. 332 00:34:09,496 --> 00:34:12,913 It also sometimes tries to make diagrams. 333 00:34:14,239 --> 00:34:19,313 For those of you that have taken algebraic topology, you know that these commutative diagrams are kind of a thing that you work with a lot 334 00:34:19,313 --> 00:34:26,379 So it kind of got the general gist of how to make those diagrams, but they actually don't make any sense. 335 00:34:26,380 --> 00:34:32,728 And actually, one of my favorite examples here is that it sometimes omits proofs. 336 00:34:32,728 --> 00:34:39,694 So it'll sometimes say, it'll sometimes say something like theorem, blah, blah, blah, blah, blah, proof omitted. 337 00:34:39,695 --> 00:34:45,392 This thing kind of has gotten the gist of how some of these math textbooks look like. 338 00:34:47,831 --> 00:34:53,497 We can have a lot of fun with this. So we also tried training one of these models on the entire source code of the Linux kernel. 339 00:34:53,498 --> 00:34:56,801 'Cause again, this character level stuff that we can train on, 340 00:34:56,801 --> 00:35:01,483 And then, when we sample this, it acutally again looks like C source code. 341 00:35:01,483 --> 00:35:06,192 - It knows how to write if statements. - It has, like, pretty good code formatting skills. 342 00:35:06,192 --> 00:35:09,552 It knows to indent after these if statements. It knows to put curly braces. 343 00:35:09,552 --> 00:35:15,052 It actually even makes comments about some things that are usually nonsense. 344 00:35:15,052 --> 00:35:23,355 One problem with this model is that it knows how to declare variables. But it doesn't always use the variables that it declares. 345 00:35:23,355 --> 00:35:27,554 And sometimes it tries to use variables that haven't been declared. This wouldn't compile. 346 00:35:27,554 --> 00:35:30,555 I would not recommend sending this as a pull request to Linux. 347 00:35:30,555 --> 00:35:37,867 This thing also figures out how to recite the GNU, this GNU license character by character. 348 00:35:37,867 --> 00:35:44,962 It kind of knows that you need to recite the GNU license and after the license comes some includes, then some other includes, then source code. 349 00:35:44,962 --> 00:35:48,836 This thing has actually learned quite a lot about the general structure of the data. 350 00:35:48,836 --> 00:35:53,398 Where, again, during training, all we asked this model to do was try to predict the next character in the sequence. 351 00:35:53,398 --> 00:35:55,406 We didn't tell it any of this structure, 352 00:35:55,406 --> 00:36:03,355 but somehow, just through the course of this training process, it learned a lot about the latent structure in the sequential data. 353 00:36:05,808 --> 00:36:10,246 Yeah, so it knows how to write code. It does a lot of cool stuff. 354 00:36:10,246 --> 00:36:20,856 I had this paper with Andre a couple years ago where we trained a bunch of these models and then we wanted to try to poke into the brains of these models and figure out like what are they doing and why are they working. 355 00:36:20,856 --> 00:36:28,165 So we saw, in our, these recurring neural networks has this hidden vector which is, maybe, some vector that's updated over every time step. 356 00:36:28,165 --> 00:36:34,176 And then what we wanted to try to figure out is could we find some elements of this vector that have some Symantec interpretable meaning. 357 00:36:34,176 --> 00:36:39,917 So what we did is we trained a neural network language model, one of these character level models on one of these data sets, 358 00:36:39,917 --> 00:36:47,923 and then we picked one of the elements in that hidden vector and now we look at what is the value of that hidden vector over the course of a sequence 359 00:36:47,923 --> 00:36:52,206 to try to get some sense of maybe what these different hidden states are looking for. 360 00:36:52,206 --> 00:36:56,406 When you do this, a lot of them end up looking kind of like random gibberish garbage. 361 00:36:56,406 --> 00:37:02,686 So here again, what we've done, is we've picked one element of that vector, and now we run the sequence forward through the trained model, 362 00:37:02,686 --> 00:37:10,876 and now the color of each character corresponds to the magnitude of that single scaler element of the hidden vector at every time step when it's reading the sequence. 363 00:37:10,876 --> 00:37:15,883 So you can see that a lot of the vectors in these hidden states are kind of not very interpretable. 364 00:37:15,883 --> 00:37:21,396 It seems like they're kind of doing some of this low level language modeling to figure out what character should come next. 365 00:37:21,396 --> 00:37:23,438 But some of them end up quite nice. 366 00:37:23,438 --> 00:37:26,375 So here we found this vector that is looking for quotes. 367 00:37:26,375 --> 00:37:32,486 You can see that there's this one hidden element, this one element in the vector, that is off, off, off, off, off blue 368 00:37:32,486 --> 00:37:38,303 and then once it hits a quote, it turns on and remains on for the duration of this quote. 369 00:37:39,187 --> 00:37:42,598 And now when we hit the second quotation mark, then that cell turns off. 370 00:37:42,598 --> 00:37:48,856 So somehow, even though this model was only trained to predict the next character in a sequence, it somehow learned that a useful thing, 371 00:37:48,856 --> 00:37:54,222 in order to do this, might be to have some cell that's trying to detect quotes. 372 00:37:54,222 --> 00:38:00,104 We also found this other cell that is, looks like it's counting the number of characters since a line break. 373 00:38:00,104 --> 00:38:07,055 So you can see that at the beginning of each line, this element starts off at zero. Throughout the course of the line, it's gradually more red, 374 00:38:07,055 --> 00:38:11,976 so that value increases. And then after the new line character, it resets to zero. 375 00:38:11,976 --> 00:38:19,545 So you can imagine that maybe this cell is letting the network keep track of when it needs to write to produce these new line characters. 376 00:38:19,545 --> 00:38:22,987 We also found some that, when we trained on the linux source code, 377 00:38:22,987 --> 00:38:33,838 we found some examples that are turning on inside the conditions of if statements. So this maybe allows the network to differentiate whether it's outside an if statement or inside that condition, 378 00:38:33,838 --> 00:38:35,806 which might help it model these sequences better. 379 00:38:35,806 --> 00:38:44,765 We also found some that turn on in comments, or some that seem like they're counting the number of indentation levels. 380 00:38:44,765 --> 00:38:50,758 This is all just really cool stuff because it's saying that even though we are only trying to train this model to predict next characters, 381 00:38:50,758 --> 00:38:55,646 it somehow ends up learning a lot of useful structure about the input data. 382 00:38:57,161 --> 00:39:05,528 One kind of thing that we often use, so this is not really been computer vision so far, and we need to pull this back to computer vision since this is a vision class. 383 00:39:05,528 --> 00:39:08,747 We've alluded many times to this image captioning model 384 00:39:08,747 --> 00:39:14,376 - where we want to build models that can input - an image and then output a caption in natural language. 385 00:39:14,376 --> 00:39:19,787 There were a bunch of papers a couple years ago that all had relatively similar approaches. 386 00:39:19,787 --> 00:39:25,026 But I'm showing the figure from the paper from our lab in a totally un-biased way. 387 00:39:26,876 --> 00:39:35,608 But, the idea here is that the caption is this variably length sequence that we might, the sequence might have different numbers of words for different captions. 388 00:39:35,608 --> 00:39:39,947 So this is a totally natural fit for a recurrent neural network language model. 389 00:39:39,947 --> 00:39:50,379 So then what this model looks like is we have some convolutional network which will input the, which will take as input the image, and we've seen a lot about how convolution networks work at this point, 390 00:39:50,379 --> 00:39:58,928 and that convolutional network will produce a summary vector of the image which will then feed into the first time step of one of these recurrent neural network language models 391 00:39:58,928 --> 00:40:02,888 which will then produce words of the caption one at a time. 392 00:40:02,888 --> 00:40:06,425 So the way that this kind of works at test time after the model is trained 393 00:40:06,425 --> 00:40:11,178 looks almost exactly the same as these character level language models that we saw a little bit ago. 394 00:40:11,178 --> 00:40:18,638 We'll take our input image, feed it through our convolutional network. But now instead of taking the softmax scores from an image net model, 395 00:40:18,638 --> 00:40:24,915 we'll instead take this 4,096 dimensional vector from the end of the model, 396 00:40:24,915 --> 00:40:31,377 and we'll take that vector and use it to summarize the whole content of the image. 397 00:40:31,377 --> 00:40:39,528 Now, remember when we talked about RNN language models, we said that we need to see the language model with that first initial input to tell it to start generating text. 398 00:40:39,528 --> 00:40:49,006 So in this case, we'll give it some special start token, which is just saying, hey, this is the start of a sentence. Please start generating some text conditioned on this image information. 399 00:40:49,006 --> 00:40:57,107 So now previously, we saw that in this RNN language model, we had these matrices that were taking the previous, the input at the current time step 400 00:40:57,107 --> 00:41:01,240 and the hidden state of the previous time step and combining those to get the next hidden state. 401 00:41:01,240 --> 00:41:04,678 Well now, we also need to add in this image information. 402 00:41:04,678 --> 00:41:13,105 So one way, people play around with exactly different ways to incorporate this image information, but one simple way is just to add a third weight matrix that is 403 00:41:14,561 --> 00:41:18,598 adding in this image information at every time step to compute the next hidden state. 404 00:41:18,598 --> 00:41:23,635 So now, we'll compute this distribution over all scores in our vocabulary 405 00:41:23,635 --> 00:41:27,307 and here, our vocabulary is something like all English words, so it could be pretty large. 406 00:41:27,307 --> 00:41:32,099 We'll sample from that distribution and now pass that word back as input at the next time step. 407 00:41:32,099 --> 00:41:39,546 And that will then feed that word in, again get a distribution over all words in the vocab, and again sample to produce the next word. 408 00:41:39,546 --> 00:41:47,538 So then, after that thing is all done, we'll maybe generate, we'll generate this complete sentence. 409 00:41:47,538 --> 00:41:54,347 We stop generation once we sample the special ends token, which kind of corresponds to the period at the end of the sentence. 410 00:41:54,347 --> 00:42:00,328 Then once the network samples this ends token, we stop generation and we're done and we've gotten our caption for this image. 411 00:42:00,328 --> 00:42:06,739 And now, during training, we trained this thing to generate, like we put an end token at the end of every caption during training 412 00:42:06,739 --> 00:42:10,907 so that the network kind of learned during training that end tokens come at the end of sequences. 413 00:42:10,907 --> 00:42:16,848 So then, during test time, it tends to sample these end tokens once it's done generating. 414 00:42:16,848 --> 00:42:20,139 So we trained this model in kind of a completely supervised way. 415 00:42:20,139 --> 00:42:24,763 You can find data sets that have images together with natural language captions. 416 00:42:24,763 --> 00:42:28,666 Microsoft COCO is probably the biggest and most widely used for this task. 417 00:42:28,666 --> 00:42:47,837 But you can just train this model in a purely supervised way. And then backpropagate through to jointly train both this recurrent neural network language model and then also pass gradients back into this final layer of this the CNN and additionally update the weights of the CNN to jointly tune all parts of the model to perform this task. 418 00:42:47,837 --> 00:42:52,438 Once you train these models, they actually do some pretty reasonable things. 419 00:42:52,438 --> 00:42:56,689 These are some real results from a model, from one of these trained models, 420 00:42:56,689 --> 00:43:00,995 and it says things like a cat sitting on a suitcase on the floor, 421 00:43:00,995 --> 00:43:02,583 which is pretty impressive. 422 00:43:02,583 --> 00:43:04,975 It knows about cats sitting on a tree branch, 423 00:43:04,975 --> 00:43:06,499 which is also pretty cool. 424 00:43:06,499 --> 00:43:09,631 It knows about two people walking on the beach with surfboards. 425 00:43:09,631 --> 00:43:16,635 So these models are actually pretty powerful and can produce relatively complex captions to describe the image. 426 00:43:16,635 --> 00:43:19,808 But that being said, these models are really not perfect. 427 00:43:19,808 --> 00:43:24,319 They're not magical. Just like any machine learning model, 428 00:43:24,319 --> 00:43:28,830 if you try to run them on data that was very different from the training data, they don't work very well. 429 00:43:28,830 --> 00:43:33,317 So for example, this example, it says a woman is holding a cat in her hand. 430 00:43:33,317 --> 00:43:35,421 There's clearly no cat in the image. 431 00:43:35,421 --> 00:43:41,359 But she is wearing a fur coat, and maybe the texture of that coat kind of looked like a cat to the model. 432 00:43:41,359 --> 00:43:44,379 Over here, we see a woman standing on a beach holding a surfboard. 433 00:43:44,379 --> 00:43:47,892 Well, she's definitely not holding a surfboard and she's doing a handstand, 434 00:43:47,892 --> 00:43:51,820 which is maybe the interesting part of that image, and the model totally missed that. 435 00:43:51,820 --> 00:43:58,238 Also, over here, we see this example where there's this picture of a spider web in the tree branch, and it totally, 436 00:43:58,238 --> 00:44:00,382 and it says something like a bird sitting on a tree branch. 437 00:44:00,382 --> 00:44:05,139 So it totally missed the spider, but during training, it never really saw examples of spiders. 438 00:44:05,139 --> 00:44:08,217 It just knows that birds sit on tree branches during training. 439 00:44:08,217 --> 00:44:10,152 So it kind of makes these reasonable mistakes. 440 00:44:10,152 --> 00:44:14,338 Or here at the bottom, it can't really tell the difference between this guy throwing and catching the ball, 441 00:44:14,338 --> 00:44:17,709 but it does know that it's a baseball player and there's balls and things involved. 442 00:44:17,709 --> 00:44:20,441 So again, just want to say that these models are not perfect. 443 00:44:20,441 --> 00:44:25,109 They work pretty well when you ask them to caption images that were similar to the training data, 444 00:44:25,109 --> 00:44:29,188 but they definitely have a hard time generalizing far beyond that. 445 00:44:29,188 --> 00:44:34,152 So another thing you'll sometimes see is this slightly more advanced model called Attention, 446 00:44:34,152 --> 00:44:41,036 where now when we're generating the words of this caption, we can allow the model to steer it's attention to different parts of the image. 447 00:44:41,036 --> 00:44:44,670 And I don't want to spend too much time on this. 448 00:44:44,670 --> 00:44:48,925 But the general way that this works is that now our convolutional network, 449 00:44:48,925 --> 00:44:59,954 rather than producing a single vector summarizing the entire image, now it produces some grid of vectors that summarize the, that give maybe one vector for each spatial location in the image. 450 00:44:59,954 --> 00:45:06,618 And now, when we, when this model runs forward, in addition to sampling the vocabulary at every time step, 451 00:45:06,618 --> 00:45:11,263 it also produces a distribution over the locations in the image where it wants to look. 452 00:45:11,263 --> 00:45:18,276 And now this distribution over image locations can be seen as a kind of a tension of where the model should look during training. 453 00:45:18,276 --> 00:45:23,155 So now that first hidden state computes this distribution over image locations, 454 00:45:23,155 --> 00:45:31,372 which then goes back to the set of vectors to give a single summary vector that maybe focuses the attention on one part of that image. 455 00:45:31,372 --> 00:45:36,508 And now that summary vector gets fed, as an additional input, at the next time step of the neural network. 456 00:45:36,508 --> 00:45:38,278 And now again, it will produce two outputs. 457 00:45:38,278 --> 00:45:43,714 One is our distribution over vocabulary words. And the other is a distribution over image locations. 458 00:45:43,714 --> 00:45:49,160 This whole process will continue, and it will sort of do these two different things at every time step. 459 00:45:50,296 --> 00:45:58,298 And after you train the model, then you can see that it kind of will shift it's attention around the image for every word that it generates in the caption. 460 00:45:58,298 --> 00:46:04,436 Here you can see that it produced the caption, a bird is flying over, I can't see that far. 461 00:46:04,436 --> 00:46:12,473 But you can see that its attention is shifting around different parts of the image for each word in the caption that it generates. 462 00:46:14,112 --> 00:46:19,147 There's this notion of hard attention versus soft attention, which I don't really want to get into too much, 463 00:46:19,147 --> 00:46:26,614 but with this idea of soft attention, we're kind of taking a weighted combination of all features from all image locations, 464 00:46:26,614 --> 00:46:34,259 whereas in the hard attention case, we're forcing the model to select exactly one location to look at in the image at each time step. 465 00:46:34,259 --> 00:46:38,277 So the hard attention case where we're selecting exactly one image location is a little bit tricky 466 00:46:38,277 --> 00:46:46,247 because that is not really a differentiable function, so you need to do something slightly fancier than vanilla backpropagation in order to just train the model in that scenario. 467 00:46:46,247 --> 00:46:51,498 And I think we'll talk about that a little bit later in the lecture on reinforcement learning. 468 00:46:51,498 --> 00:46:57,415 Now, when you look at after you train one of these attention models and then run it on to generate captions, 469 00:46:57,415 --> 00:47:04,067 you can see that it tends to focus it's attention on maybe the salient or semanticly meaningful part of the image when generating captions. 470 00:47:04,067 --> 00:47:08,413 You can see that the caption was a woman is throwing a frisbee in a park 471 00:47:08,413 --> 00:47:14,516 and you can see that this attention mask, when it generated the word, when the model generated the word frisbee, at the same time, 472 00:47:14,516 --> 00:47:18,976 it was focusing it's attention on this image region that actually contains the frisbee. 473 00:47:18,976 --> 00:47:24,542 This is actually really cool. We did not tell the model where it should be looking at every time step. 474 00:47:24,542 --> 00:47:27,955 It sort of figured all that out for itself during the training process. 475 00:47:27,955 --> 00:47:32,721 Because somehow, it figured out that looking at that image region was the right thing to do for this image. 476 00:47:32,721 --> 00:47:34,610 And because everything in this model is differentiable, 477 00:47:34,610 --> 00:47:42,541 because we can backpropagate through all these soft attention steps, all of this soft attention stuff just comes out through the training process. 478 00:47:42,541 --> 00:47:45,297 So that's really, really cool. 479 00:47:45,297 --> 00:47:51,565 By the way, this idea of recurrent neural networks and attention actually gets used in other tasks beyond image captioning. 480 00:47:51,565 --> 00:47:54,691 One recent example is this idea of visual question answering. 481 00:47:54,691 --> 00:48:04,010 So here, our model is going to take two things as input. It's going to take an image and it will also take a natural language question that's asking some question about the image. 482 00:48:04,010 --> 00:48:10,163 Here, we might see this image on the left and we might ask the question, what endangered animal is featured on the truck? 483 00:48:10,163 --> 00:48:19,027 And now the model needs to select from one of these four natural language answers about which of these answers correctly answers that question in the context of the image. 484 00:48:19,027 --> 00:48:24,774 So you can imagine kind of stitching this model together using CNNs and RNNs in kind of a natural way. 485 00:48:24,774 --> 00:48:29,701 Now, we're in this many to one scenario, 486 00:48:29,701 --> 00:48:33,566 where now our model needs to take as input this natural language sequence, 487 00:48:33,566 --> 00:48:40,111 so we can imagine running a recurrent neural network over each element of that input question, to now summarize the input question in a single vector. 488 00:48:40,111 --> 00:48:43,928 And then we can have a CNN to again summarize the image, 489 00:48:43,928 --> 00:48:51,645 and now combine both the vector from the CNN and the vector from the question and coding RNN to then predict a distribution over answers. 490 00:48:51,645 --> 00:48:59,175 We also sometimes, you'll also sometimes see this idea of soft spacial attention being incorporated into things like visual question answering. 491 00:48:59,175 --> 00:49:07,062 So you can see that here, this model is also having the spatial attention over the image when it's trying to determine answers to the questions. 492 00:49:07,062 --> 00:49:09,062 Just to, yeah, question? 493 00:49:18,715 --> 00:49:20,878 So the question is How are the different inputs combined? 494 00:49:20,878 --> 00:49:25,998 Do you mean like the encoded question vector and the encoded image vector? 495 00:49:25,998 --> 00:49:30,395 Yeah, so the question is how are the encoded image and the encoded question vector combined? 496 00:49:30,395 --> 00:49:34,847 Kind of the simplest thing to do is just to concatenate them and stick them into fully connected layers. 497 00:49:34,847 --> 00:49:37,864 That's probably the most common and that's probably the first thing to try. 498 00:49:37,864 --> 00:49:44,389 Sometimes people do slightly fancier things where they might try to have multiplicative interactions between those two vectors to allow a more powerful function. 499 00:49:44,389 --> 00:49:48,050 But generally, concatenation is kind of a good first thing to try. 500 00:49:49,426 --> 00:49:55,084 Okay, so now we've talked about a bunch of scenarios where RNNs are used for different kinds of problems. And I think it's super cool 501 00:49:55,084 --> 00:50:04,072 because it allows you to start tackling really complicated problems combining images and computer vision with natural language processing. 502 00:50:04,072 --> 00:50:08,870 And you can see that we can kind of stith together these models like Lego blocks and attack really complicated things, 503 00:50:08,870 --> 00:50:16,736 Like image captioning or visual question answering just by stitching together these relatively simple types of neural network modules. 504 00:50:18,708 --> 00:50:26,274 But I'd also like to mention that so far, we've talked about this idea of a single recurrent network layer, where we have sort of one hidden state, 505 00:50:26,274 --> 00:50:32,581 and another thing that you'll see pretty commonly is this idea of a multilayer recurrent neural network. 506 00:50:32,581 --> 00:50:44,554 Here, this is a three layer recurrent neural network, so now our input goes in, goes into, goes in and produces a sequence of hidden states from the first recurrent neural network layer. 507 00:50:44,554 --> 00:50:50,468 And now, after we run kind of one recurrent neural network layer, then we have this whole sequence of hidden states. 508 00:50:50,468 --> 00:50:56,474 And now, we can use the sequence of hidden states as an input sequence to another recurrent neural network layer. 509 00:50:56,474 --> 00:51:01,577 And then you can just imagine, which will then produce another sequence of hidden states from the second RNN layer. 510 00:51:01,577 --> 00:51:04,181 And then you can just imagine stacking these things on top of each other, 511 00:51:04,181 --> 00:51:09,398 cause we know that we've seen in other contexts that deeper models tend to perform better for various problems. 512 00:51:09,398 --> 00:51:18,215 And the same kind of holds in RNNs as well. For many problems, you'll see maybe a two or three layer recurrent neural network model is pretty commonly used. 513 00:51:18,215 --> 00:51:22,086 You typically don't see super deep models in RNNs. 514 00:51:22,086 --> 00:51:29,327 So generally, like two, three, four layer RNNs is maybe as deep as you'll typically go. 515 00:51:29,327 --> 00:51:38,222 Then, I think it's also really interesting and important to think about, now we've seen kind of what kinds of problems these RNNs can be used for, 516 00:51:38,222 --> 00:51:43,229 but then you need to think a little bit more carefully about exactly what happens to these models when we try to train them. 517 00:51:43,229 --> 00:51:47,447 So here, I've drawn this little vanilla RNN cell that we've talked about so far. 518 00:51:47,447 --> 00:51:55,307 So here, we're taking our current input, x t, and our previous hidden state, h t minus one, and then we stack, those are two vectors. 519 00:51:55,307 --> 00:52:05,783 So we can just stack them together. And then perform this matrix multiplication with our weight matrix, to give our, and then squash that output through a tanh, and that will give us our next hidden state. 520 00:52:05,783 --> 00:52:10,131 And that's kind of the basic functional form of this vanilla recurrent neural network. 521 00:52:10,131 --> 00:52:17,738 But then, we need to think about what happens in this architecture during the backward pass when we try to compute gradients? 522 00:52:17,738 --> 00:52:27,477 So then if we think about trying to compute, so then during the backwards pass, we'll receive the derivative of our h t, we'll receive derivative of loss with respect to h t. 523 00:52:28,568 --> 00:52:34,632 And during the backward pass through the cell, we'll need to compute derivative of loss to the respect of h t minus one. 524 00:52:34,632 --> 00:52:39,881 Then, when we compute this backward pass, we see that the gradient flows backward through this red path. 525 00:52:39,881 --> 00:52:45,958 So first, that gradient will flow backwards through this tanh gate, and then it will flow backwards through this matrix multiplication gate. 526 00:52:45,958 --> 00:52:54,711 And then, as we've seen in the homework and when implementing these matrix multiplication layers, when you backpropagate through this matrix multiplication gate, 527 00:52:54,711 --> 00:52:58,457 you end up mulitplying by the transpose of that weight matrix. 528 00:52:58,457 --> 00:53:08,024 So that means that every time we backpropagate through one of these vanilla RNN cells, we end up multiplying by some part of the weight matrix. 529 00:53:09,110 --> 00:53:16,599 So now if you imagine that we are sticking many of these recurrent neural network cells in sequence, because again this is an RNN. We want a model sequences. 530 00:53:16,599 --> 00:53:20,921 Now if you imagine what happens to the gradient flow through a sequence of these layers, 531 00:53:20,921 --> 00:53:23,770 then something kind of fishy starts to happen. 532 00:53:23,770 --> 00:53:30,810 Because now, when we want to compute the gradient of the loss with respect to h zero, we need to backpropagate through every one of these RNN cells. 533 00:53:30,810 --> 00:53:35,641 And every time you backpropagate through one cell, you'll pick up one of these w transpose factors. 534 00:53:35,641 --> 00:53:44,967 So that means that the final expression for the gradient on h zero will involve many, many factors of this weight matrix, which could be kind of bad. 535 00:53:44,967 --> 00:53:50,138 Maybe don't think about the weight, the matrix case, but imagine a scaler case. 536 00:53:50,138 --> 00:53:54,563 If we end up, if we have some scaler and we multiply by that same number over and over and over again, 537 00:53:54,563 --> 00:54:00,269 maybe not for four examples, but for something like a hundred or several hundred time steps, 538 00:54:00,269 --> 00:54:03,835 then multiplying by the same number over and over again is really bad. 539 00:54:03,835 --> 00:54:14,069 In the scaler case, it's either going to explode in the case that that number is greater than one or it's going to vanish towards zero in the case that number is less than one in absolute value. 540 00:54:14,069 --> 00:54:18,186 And the only way in which this will not happen is if that number is exactly one, 541 00:54:18,186 --> 00:54:20,911 which is actually very rare to happen in practice. 542 00:54:20,911 --> 00:54:25,229 That leaves us to, that same intuition extends to the matrix case, 543 00:54:25,229 --> 00:54:32,071 but now, rather than the absolute value of a scaler number, you instead need to look at the largest, the largest singular value of this weight matrix. 544 00:54:32,071 --> 00:54:46,079 Now if that largest singular value is greater than one, then during this backward pass, when we multiply by the weight matrix over and over, that gradient on h w, on h zero, sorry, will become very, very large, when that matrix is too large. 545 00:54:46,079 --> 00:54:48,947 And that's something we call the exploding gradient problem. 546 00:54:48,947 --> 00:54:55,061 Where now this gradient will explode exponentially in depth with the number of time steps that we backpropagate through. 547 00:54:55,061 --> 00:55:05,349 And if the largest singular value is less than one, then we get the opposite problem, where now our gradients will shrink and shrink and shrink exponentially, as we backpropagate and pick up more and more factors of this weight matrix. 548 00:55:05,349 --> 00:55:08,863 That's called the vanishing gradient problem. 549 00:55:08,863 --> 00:55:14,208 THere's a bit of a hack that people sometimes do to fix the exploding gradient problem called gradient clipping, 550 00:55:14,208 --> 00:55:19,713 which is just this simple heuristic saying that after we compute our gradient, if that gradient, 551 00:55:19,713 --> 00:55:27,955 if it's L2 norm is above some threshold, then just clamp it down and divide, just clamp it down so it has this maximum threshold. 552 00:55:27,955 --> 00:55:32,645 This is kind of a nasty hack, but it actually gets used in practice quite a lot when training recurrent neural networks. 553 00:55:32,645 --> 00:55:39,034 And it's a relatively useful tool for attacking this exploding gradient problem. 554 00:55:39,034 --> 00:55:46,207 But now for the vanishing gradient problem, what we typically do is we might need to move to a more complicated RNN architecture. 555 00:55:46,207 --> 00:55:53,524 So that motivates this idea of an LSTM. An LSTM, which stands for Long Short Term Memory, 556 00:55:53,524 --> 00:55:58,316 is this slightly fancier recurrence relation for these recurrent neural networks. 557 00:55:58,316 --> 00:56:03,330 It's really designed to help alleviate this problem of vanishing and exploding gradients. 558 00:56:03,330 --> 00:56:10,193 So that rather than kind of hacking on top of it, we just kind of design the architecture to have better gradient flow properties. 559 00:56:10,193 --> 00:56:16,556 Kind of an analogy to those fancier CNN architectures that we saw at the top of the lecture. 560 00:56:16,556 --> 00:56:20,807 Another thing to point out is that the LSTM cell actually comes from 1997. 561 00:56:20,807 --> 00:56:24,073 So this idea of an LSTM has been around for quite a while, 562 00:56:24,073 --> 00:56:28,852 and these folks were working on these ideas way back in the 90s, were definitely ahead of the curve. 563 00:56:28,852 --> 00:56:32,475 Because these models are kind of used everywhere now 20 years later. 564 00:56:33,864 --> 00:56:37,921 And LSTMs kind of have this funny functional form. 565 00:56:37,921 --> 00:56:46,397 So remember when we had this vanilla recurrent neural network, it had this hidden state. And we used this recurrence relation to update the hidden state at every time step. 566 00:56:46,397 --> 00:56:51,462 Well, now in an LSTM, we actually have two, we maintain two hidden states at every time step. 567 00:56:51,462 --> 00:56:59,305 One is this h t, which is called the hidden state, which is kind of an analogy to the hidden state that we had in the vanilla RNN. 568 00:56:59,305 --> 00:57:03,604 But an LSTM also maintains the second vector, c t, called the cell state. 569 00:57:03,604 --> 00:57:12,240 And the cell state is this vector which is kind of internal, kept inside the LSTM, and it does not really get exposed to the outside world. 570 00:57:12,240 --> 00:57:15,260 And we'll see, and you can kind of see that through this update equation, 571 00:57:15,260 --> 00:57:20,803 where you can see that when we, first when we compute these, we take our two inputs, 572 00:57:20,803 --> 00:57:25,485 we use them to compute these four gates called i, f, o, n, g. 573 00:57:25,485 --> 00:57:33,802 We use those gates to update our cell states, c t, and then we expose part of our cell state as the hidden state at the next time step. 574 00:57:36,704 --> 00:57:44,014 This is kind of a funny functional form, and I want to walk through for a couple slides exactly why do we use this architecture and why does it make sense, 575 00:57:44,014 --> 00:57:47,731 especially in the context of vanishing or exploding gradients. 576 00:57:47,731 --> 00:57:57,213 This first thing that we do in an LSTM is that we're given this previous hidden state, h t, and we're given our current input vector, x t, 577 00:57:57,213 --> 00:57:58,611 and just like the vanilla RNN. 578 00:57:58,611 --> 00:58:08,663 In the vanilla RNN, remember, we took those two input vectors. We concatenated them. Then we did a matrix multiply to directly compute the next hidden state in the RNN. 579 00:58:08,663 --> 00:58:15,315 Now, the LSTM does something a little bit different. We're going to take our previous hidden state and our current input, stack them, 580 00:58:15,315 --> 00:58:21,926 and now multiply by a very big weight matrix, w, to compute four different gates, 581 00:58:21,926 --> 00:58:24,391 Which all have the same size as the hidden state. 582 00:58:24,391 --> 00:58:30,808 Sometimes, you'll see this written in different ways. Some authors will write a different weight matrix for each gate. 583 00:58:30,808 --> 00:58:33,160 Some authors will combine them all into one big weight matrix. 584 00:58:33,160 --> 00:58:34,677 But it's all really the same thing. 585 00:58:34,677 --> 00:58:40,288 The ideas is that we take our hidden state, our current input, and then we use those to compute these four gates. 586 00:58:40,288 --> 00:58:47,768 These four gates are the, you often see this written as i, f, o, g, ifog, which makes it pretty easy to remember what they are. 587 00:58:47,768 --> 00:58:53,655 I is the input gate. It says how much do we want to input into our cell. 588 00:58:53,655 --> 00:59:00,194 F is the forget gate. How much do we want to forget the cell memory at the previous, from the previous time step. 589 00:59:00,194 --> 00:59:04,653 O is the output gate, which is how much do we want to reveal ourself to the outside world. 590 00:59:04,653 --> 00:59:10,130 And G really doesn't have a nice name, so I usually call it the gate gate. 591 00:59:10,130 --> 00:59:14,626 G, it tells us how much do we want to write into our input cell. 592 00:59:14,626 --> 00:59:19,665 And then you notice that each of these four gates are using a different non linearity. 593 00:59:21,724 --> 00:59:25,047 The input, forget and output gate are all using sigmoids, 594 00:59:25,047 --> 00:59:28,571 which means that their values will be between zero and one. 595 00:59:28,571 --> 00:59:34,316 Whereas the gate gate uses a tanh, which means it's output will be between minus one and one. 596 00:59:34,316 --> 00:59:41,725 So, these are kind of weird, but it makes a little bit more sense if you imagine them all as binary values. 597 00:59:41,725 --> 00:59:44,744 Right, like what happens at the extremes of these two values? 598 00:59:46,033 --> 00:59:53,926 It's kind of what happens, if you look after we compute these gates if you look at this next equation, you can see that our cell state is being multiplied element wise by the forget gate. 599 00:59:53,926 --> 01:00:00,350 Sorry, our cell state from the previous time step is being multiplied element wise by this forget gate. And now if this forget gate, 600 01:00:00,350 --> 01:00:03,861 you can think of it as being a vector of zeros and ones, 601 01:00:03,861 --> 01:00:10,644 that's telling us for each element in the cell state, do we want to forget that element of the cell in the case if the forget gate was zero? 602 01:00:10,644 --> 01:00:14,935 Or do we want to remember that element of the cell in the case if the forget gate was one. 603 01:00:14,935 --> 01:00:19,767 Now, once we've used the forget gate to gate off the part of the cell state, 604 01:00:19,767 --> 01:00:24,814 then we have the second term, which is the element wise product of i and g. 605 01:00:24,814 --> 01:00:28,824 So now, i is this vector of zeros and ones, cause it's coming through a sigmoid, 606 01:00:28,824 --> 01:00:35,728 telling us for each element of the cell state, do we want to write to that element of the cell state in the case that i is one, 607 01:00:35,728 --> 01:00:41,719 or do we not want to write to that element of the cell state at this time step in the case that i is zero. 608 01:00:41,719 --> 01:00:46,031 And now the gate gate, because it's coming through a tanh, will be either one or minus one. 609 01:00:46,031 --> 01:00:54,022 So that is the value that we want, the candidate value that we might consider writing to each element of the cell state at this time step. 610 01:00:54,022 --> 01:01:02,499 Then if you look at the cell state equation, you can see that at every time step, the cell state has these kind of these different, independent scaler values, 611 01:01:02,499 --> 01:01:05,230 and they're all being incremented or decremented by one. 612 01:01:05,230 --> 01:01:11,028 So there's kind of like, inside the cell state, we can either remember or forget our previous state, 613 01:01:11,028 --> 01:01:16,535 and then we can either increment or decrement each element of that cell state by up to one at each time step. 614 01:01:16,535 --> 01:01:25,708 So you can kind of think of these elements of the cell state as being little scaler integer counters that can be incremented and decremented at each time step. 615 01:01:25,708 --> 01:01:36,513 And now, after we've computed our cell state, then we use our now updated cell state to compute a hidden state, which we will reveal to the outside world. 616 01:01:36,513 --> 01:01:43,353 So because this cell state has this interpretation of being counters, and sort of counting up by one or minus one at each time step, 617 01:01:43,353 --> 01:01:51,826 we want to squash that counter value into a nice zero to one range using a tanh. And now, we multiply element wise, by this output gate. 618 01:01:51,826 --> 01:01:57,597 And the output gate is again coming through a sigmoid, so you can think of it as being mostly zeros and ones, 619 01:01:57,597 --> 01:02:08,577 and the output gate tells us for each element of our cell state, do we want to reveal or not reveal that element of our cell state when we're computing the external hidden state for this time step. 620 01:02:09,736 --> 01:02:17,132 And then, I think there's kind of a tradition in people trying to explain LSTMs, that everyone needs to come up with their own potentially confusing LSTM diagram. 621 01:02:17,132 --> 01:02:18,882 So here's my attempt. 622 01:02:20,380 --> 01:02:23,866 Here, we can see what's going on inside this LSTM cell, 623 01:02:23,866 --> 01:02:31,266 is that we take our, we're taking as input on the left our previous cell state and the previous hidden state, as well as our current input, x t. 624 01:02:31,266 --> 01:02:37,346 Now we're going to take our current, our previous hidden state, as well as our current input, stack them, 625 01:02:37,346 --> 01:02:41,166 and then multiply with this weight matrix, w, to produce our four gates. 626 01:02:41,166 --> 01:02:44,836 And here, I've left out the non linearities because we saw those on a previous slide. 627 01:02:44,836 --> 01:02:48,143 And now the forget gate multiplies element wise with the cell state. 628 01:02:48,143 --> 01:02:54,524 The input and gate gate are multiplied element wise and added to the cell state. And that gives us our next cell. 629 01:02:54,524 --> 01:03:01,366 The next cell gets squashed through a tanh, and multiplied element wise with this output gate to produce our next hidden state. 630 01:03:02,417 --> 01:03:03,250 Question? 631 01:03:13,116 --> 01:03:17,878 No, So they're coming through this, they're coming from different parts of this weight matrix. 632 01:03:17,878 --> 01:03:26,415 So if our hidden, if our x and our h all have this dimension h, then after we stack them, they'll be a vector size two h, 633 01:03:26,415 --> 01:03:30,393 and now our weight matrix will be this matrix of size four h times two h. 634 01:03:30,393 --> 01:03:41,511 So you can think of that as sort of having four chunks of this weight matrix. And each of these four chunks of the weight matrix is going to compute a different one of these gates. 635 01:03:42,404 --> 01:03:51,109 You'll often see this written for clarity, kind of combining all four of those different weight matrices into a single large matrix, w, just for notational convenience. 636 01:03:51,109 --> 01:03:56,067 But they're all computed using different parts of the weight matrix. 637 01:03:57,080 --> 01:04:04,574 But you're correct in that they're all computed using the same functional form of just stacking the two things and taking the matrix multiplication. 638 01:04:04,574 --> 01:04:11,196 Now that we have this picture, we can think about what happens to an LSTM cell during the backwards pass? 639 01:04:11,196 --> 01:04:18,999 We saw, in the context of vanilla recurrent neural network, that some bad things happened during the backwards pass, where we were continually multiplying by that weight matrix, w. 640 01:04:18,999 --> 01:04:23,238 But now, the situation looks much, quite a bit different in the LSTM. 641 01:04:23,238 --> 01:04:29,952 If you imagine this path backwards of computing the gradients of the cell state, we get quite a nice picture. 642 01:04:29,952 --> 01:04:33,320 Now, when we have our upstream gradient from the cell coming in, 643 01:04:33,320 --> 01:04:37,737 then once we backpropagate backwards through this addition operation, 644 01:04:37,737 --> 01:04:43,663 remember that this addition just copies that upstream gradient into the two branches, 645 01:04:43,663 --> 01:04:50,435 so our upstream gradient gets copied directly and passed directly to backpropagating through this element wise multiply. 646 01:04:50,435 --> 01:04:56,455 So then our upstream gradient ends up getting multiplied element wise by the forget gate. 647 01:04:56,455 --> 01:05:07,171 As we backpropagate backwards through this cell state, the only thing that happens to our upstream cell state gradient is that it ends up getting multiplied element wise by the forget gate. 648 01:05:07,171 --> 01:05:12,640 This is really a lot nicer than the vanilla RNN for two reasons. 649 01:05:12,640 --> 01:05:18,498 One is that this forget gate is now an element wise multiplication rather than a full matrix multiplication. 650 01:05:18,498 --> 01:05:24,964 So element wise multiplication is going to be a little bit nicer than full matrix multiplication. 651 01:05:24,964 --> 01:05:31,354 Second is that element wise multiplication will potentially be multiplying by a different forget gate at every time step. 652 01:05:31,354 --> 01:05:36,660 So remember, in the vanilla RNN, we were continually multiplying by that same weight matrix over and over again, 653 01:05:36,660 --> 01:05:40,563 which led very explicitly to these exploding or vanishing gradients. 654 01:05:40,563 --> 01:05:45,161 But now in the LSTM case, this forget gate can vary from each time step. 655 01:05:45,161 --> 01:05:51,670 Now, it's much easier for the model to avoid these problems of exploding and vanishing gradients. 656 01:05:51,670 --> 01:05:58,438 Finally, because this forget gate is coming out from a sigmoid, this element wise multiply is guaranteed to be between zero and one, 657 01:05:58,438 --> 01:06:04,278 which again, leads to sort of nicer numerical properties if you imagine multiplying by these things over and over again. 658 01:06:04,278 --> 01:06:14,273 Another thing to notice is that in the context of the vanilla recurrent neural network, we saw that during the backward pass, our gradients were flowing through also a tanh at every time step. 659 01:06:14,273 --> 01:06:24,110 But now, in an LSTM, our outputs are, in an LSTM, our hidden state is used to compute those outputs, y t, 660 01:06:24,110 --> 01:06:33,024 so now, each hidden state, if you imagine backpropagating from the final hidden state back to the first cell state, then through that backward path, 661 01:06:33,024 --> 01:06:42,044 we only backpropagate through a single tanh non linearity rather than through a separate tanh at every time step. 662 01:06:42,044 --> 01:06:50,483 So kind of when you put all these things together, you can see this backwards pass backpropagating through the cell state is kind of a gradient super highway 663 01:06:50,483 --> 01:06:59,172 that lets gradients pass relatively unimpeded from the loss at the very end of the model all the way back to the initial cell state at the beginning of the model. 664 01:06:59,172 --> 01:07:00,922 Was there a question? 665 01:07:02,901 --> 01:07:06,792 Yeah, what about the gradient in respect to w? 'Cause that's ultimately the thing that we care about. 666 01:07:06,792 --> 01:07:19,848 So, the gradient with respect to w will come through, at every time step, will take our current cell state as well as our current hidden state and that will give us an element, that will give us our local gradient on w for that time step. 667 01:07:19,848 --> 01:07:29,587 So because our cell state, and just in the vanilla RNN case, we'll end up adding those first time step w gradients to compute our final gradient on w. 668 01:07:29,587 --> 01:07:37,280 But now, if you imagine the situation where we have a very long sequence, and we're only getting gradients to the very end of the sequence. 669 01:07:37,280 --> 01:07:43,219 Now, as you backpropagate through, we'll get a local gradient on w for each time step, 670 01:07:43,219 --> 01:07:48,506 and that local gradient on w will be coming through these gradients on c and h. 671 01:07:48,506 --> 01:07:52,751 So because we're maintaining the gradients on c much more nicely in the LSTM case, 672 01:07:52,751 --> 01:07:59,804 those local gradients on w at each time step will also be carried forward and backward through time much more cleanly. 673 01:08:01,627 --> 01:08:03,044 Another question? 674 01:08:17,428 --> 01:08:22,088 Yeah, so the question is due to the non linearities, could this still be susceptible to vanishing gradients? 675 01:08:22,089 --> 01:08:34,103 And that could be the case. Actually, so one problem you might imagine is that maybe if these forget gates are always less than zero, or always less than one, you might get vanishing gradients as you continually go through these forget gates. 676 01:08:34,103 --> 01:08:42,746 Well, one sort of trick that people do in practice is that they will, sometimes, initialize the biases of the forget gate to be somewhat positive. 677 01:08:42,746 --> 01:08:46,305 So that at the beginning of training, those forget gates are always very close to one. 678 01:08:46,305 --> 01:08:56,631 So that at least at the beginning of training, then we have not so, relatively clean gradient flow through these forget gates, since they're all initialized to be near one. 679 01:08:56,631 --> 01:09:02,741 And then throughout the course of training, then the model can learn those biases and kind of learn to forget where it needs to. 680 01:09:02,742 --> 01:09:09,182 You're right that there still could be some potential for vanishing gradients here. But it's much less extreme than the vanilla RNN case, 681 01:09:09,182 --> 01:09:12,126 both because those fs can vary at each time step, 682 01:09:12,126 --> 01:09:19,126 and also because we're doing this element wise multiplication rather than a full matrix multiplication. 683 01:09:19,126 --> 01:09:23,048 So you can see that this LSTM actually looks quite similar to ResNet. 684 01:09:23,048 --> 01:09:28,069 In this residual network, we had this path of identity connections going backward through the network 685 01:09:28,069 --> 01:09:32,303 and that gave, sort of a gradient super highway for gradients to flow backward in ResNet. 686 01:09:32,303 --> 01:09:34,924 And now it's kind of the same intuition in LSTM 687 01:09:34,924 --> 01:09:45,244 where these additive and element wise multiplicative interactions of the cell state can give a similar gradient super highway for gradients to flow backwards through the cell state in an LSTM. 688 01:09:46,343 --> 01:09:55,558 And by the way, there's this other kind of nice paper called highway networks, which is kind of in between this idea of this LSTM cell and these residual networks. 689 01:09:57,796 --> 01:10:01,165 So these highway networks actually came before residual networks, 690 01:10:01,165 --> 01:10:08,804 and they had this idea where at every layer of the highway network, we're going to compute sort of a candidate activation, as well as a gating function 691 01:10:08,804 --> 01:10:17,017 that tells us that interpolates between our previous input at that layer, and that candidate activation that came through our convolutions or what not. 692 01:10:17,017 --> 01:10:20,472 So there's actually a lot of architectural similarities between these things, 693 01:10:20,472 --> 01:10:27,688 and people take a lot of inspiration from training very deep CNNs and very deep RNNs and there's a lot of crossover here. 694 01:10:27,688 --> 01:10:34,675 Very briefly, you'll see a lot of other types of variance of recurrent neural network architectures out there in the wild. 695 01:10:34,675 --> 01:10:40,853 Probably the most common, apart from the LSTM, is this GRU, called the gated recurrent unit. 696 01:10:40,853 --> 01:10:45,701 And you can see those update equations here, and it kind of has this similar flavor of the LSTM, 697 01:10:45,701 --> 01:10:53,828 where it uses these multiplicative element wise gates together with these additive interactions to avoid this vanishing gradient problem. 698 01:10:53,828 --> 01:10:59,734 There's also this cool paper called LSTM: a search based oddysey, very inventive title, 699 01:10:59,734 --> 01:11:04,980 where they tried to play around with the LSTM equations and swap out the non linearities at one point, 700 01:11:04,980 --> 01:11:07,343 like do we really need that tanh for exposing the output gate, 701 01:11:07,343 --> 01:11:14,017 and they tried to answer a lot of these different questions about each of those non linearities, each of those pieces of the LSTM update equations. 702 01:11:14,017 --> 01:11:18,068 What happens if we change the model and tweak those LSTM equations a little bit. 703 01:11:18,068 --> 01:11:24,521 And kind of the conclusion is that they all work about the same Some of them work a little bit better than others for one problem or another. 704 01:11:24,521 --> 01:11:32,148 But generally, none of the things, none of the tweaks of LSTM that they tried were significantly better that the original LSTM for all problems. 705 01:11:32,148 --> 01:11:41,084 So that gives you a little bit more faith that the LSTM update equations seem kind of magical but they're useful anyway. You should probably consider them for your problem. 706 01:11:41,084 --> 01:11:43,692 There's also this cool paper from Google a couple years ago 707 01:11:43,692 --> 01:11:52,605 where they tried to use, where they did kind of an evolutionary search and did a search over many, over a very large number of random RNN architectures, 708 01:11:52,605 --> 01:12:00,860 they kind of randomly premute these update equations and try putting the additions and the multiplications and the gates and the non linearities in different kinds of combinations. 709 01:12:00,860 --> 01:12:08,570 They blasted this out over their huge Google cluster and just tried a whole bunch of these different weigh updates in various flavors. 710 01:12:08,570 --> 01:12:15,446 And again, it was the same story that they didn't really find anything that was significantly better than these existing GRU or LSTM styles. 711 01:12:15,446 --> 01:12:19,518 Although there were some variations that worked maybe slightly better or worse for certain problems. 712 01:12:19,518 --> 01:12:27,080 But kind of the take away is that probably and using an LSTM or GRU is not so much magic in those equations, 713 01:12:27,080 --> 01:12:33,356 but this idea of managing gradient flow properly through these additive connections and these multiplicative gates is super useful. 714 01:12:34,888 --> 01:12:40,103 So yeah, the summary is that RNNs are super cool. They can allow you to attack tons of new types of problems. 715 01:12:40,103 --> 01:12:43,431 They sometimes are susceptible to vanishing or exploding gradients. 716 01:12:43,431 --> 01:12:47,412 But we can address that with weight clipping and with fancier architectures. 717 01:12:47,412 --> 01:12:52,043 And there's a lot of cool overlap between CNN architectures and RNN architectures. 718 01:12:52,043 --> 01:13:02,801 - So next time, you'll be taking the midterm. - But after that, we'll have a, sorry, a question? Midterm is after this lecture so anything up to this point is fair game.